An investigation of recurrent neural network architectures for statistical parametric speech synthesis
نویسندگان
چکیده
In this paper, we investigate two different recurrent neural network (RNN) architectures: Elman RNN and recently proposed clockwork RNN [1] for statistical parametric speech synthesis (SPSS). Of late, deep neural networks are being used for SPSS which involve predicting every frame independent of the previous predictions, and hence requires post-processing for ensuring smooth evolution of speech parameters. RNNs, on the other hand, are intuitively better suited for the task as they inherently model temporal dependencies, but were restricted in use because of the difficulty in training. Lately, techniques such as sparse initialization, Nesterov’s accelerated gradient, gradient clipping and leaky integration (LI) have been shown to overcome this difficulty. We study the utility of these techniques for SPSS task. In addition, we show that clockwork RNN is equivalent to an Elman RNN with a particular form of LI. This perspective enables us to understand the reason why a simple Elman RNN with LI units performs well on sequential tasks.
منابع مشابه
Acoustic Modeling in Statistical Parametric Speech Synthesis – from Hmm to Lstm-rnn
Statistical parametric speech synthesis (SPSS) combines an acoustic model and a vocoder to render speech given a text. Typically decision tree-clustered context-dependent hidden Markov models (HMMs) are employed as the acoustic model, which represent a relationship between linguistic and acoustic features. Recently, artificial neural network-based acoustic models, such as deep neural networks, ...
متن کاملRecurrent Neural Network Postfilters for Statistical Parametric Speech Synthesis
In the last two years, there have been numerous papers that have looked into using Deep Neural Networks to replace the acoustic model in traditional statistical parametric speech synthesis. However, far less attention has been paid to approaches like DNN-based postfiltering where DNNs work in conjunction with traditional acoustic models. In this paper, we investigate the use of Recurrent Neural...
متن کاملStatistical Parametric Speech Synthesis Using Bottleneck Representation From Sequence Auto-encoder
In this paper, we describe a statistical parametric speech synthesis approach with unit-level acoustic representation. In conventional deep neural network based speech synthesis, the input text features are repeated for the entire duration of phoneme for mapping text and speech parameters. This mapping is learnt at the frame-level which is the de-facto acoustic representation. However much of t...
متن کاملUnit Selection with Hierarchical Cascaded Long Short Term Memory Bidirectional Recurrent Neural Nets
Bidirectional recurrent neural nets have demonstrated state-ofthe-art performance for parametric speech synthesis. In this paper, we introduce a top-down application of recurrent neural net models to unit-selection synthesis. A hierarchical cascaded network graph predicts context phone duration, speech unit encoding and frame-level logF0 information that serves as targets for the search of unit...
متن کاملContextual Representation using Recurrent Neural Network Hidden State for Statistical Parametric Speech Synthesis
In this paper, we propose to use hidden state vector obtained from recurrent neural network (RNN) as a context vector representation for deep neural network (DNN) based statistical parametric speech synthesis. While in a typical DNN based system, there is a hierarchy of text features from phone level to utterance level, they are usually in 1-hot-k encoded representation. Our hypothesis is that,...
متن کامل